Fixes for forkserver/spawn serialization and fix for LMDB upgrade issues by christinaflo · Pull Request #148 · aqlaboratory/openfold-3

christinaflo · 2026-03-26T02:32:02Z

Summary
LMDB refactor to allow for forkserver/spawn serialization and resolve issues that required pinning to LMDB==1.6.2 for training.

Changes

Exposes more dataloader options like persistent workers, prefetch factor, multiprocessing context
Lazy load CCD obj to prevent issues with serialization
Modify LMDB handling to allow for serialization. Close parent db connection prior to forking.

Related Issues
PR #143 defaults to forkserver. This PR adds additional fixes needed for training with forkserver.

…ng methods. LMDB refactor to allow for forkserver/spawn serialization + resolve issues that required pinning lmdb for training.

…ure connection is closed when dataset init finishes prior to forking

sdvillal · 2026-04-08T06:17:33Z

Minor comment: the DB preloading logic might be better handled by using vmtouch, which is available in conda‑forge and often present on HPC systems, falling back to reading the whole DB file only when vmtouch is not available. Conceptually, this would only need to be done once per node, since the page cache is shared across processes.

If using vmtouch, one could optionally consider daemon mode and page locking to reduce the risk of eager eviction under memory pressure, although this may or may not be desirable depending on system limits and sharing policies (I assume we typically reserve full nodes?). Overall, I suspect the usefulness of this depends strongly on DB size vs memory size, access locality, and competition for memory.

I just found this which could be vendored. I never used it.

jandom · 2026-04-08T14:43:04Z

Taking a look at this now @christinaflo, because this came up in a number of different PRs, prominently these two

jandom

I'm not sure we need this any more with the changes in #143 – the two small reads from LMDB will be immediately released now (via context manager) and the persistent lmdb env is staying for the entire duration of the data loader (hopefully!)

but I'm 100% confident – training tends to reveal more problems than running unit tests

Do you have any repro/test that I could run to confirm that my claim is accurate?

christinaflo · 2026-04-09T19:38:51Z

These changes address a different issue than the one addressed in #143, the original version is not serializable because of the persistent lmdb env if you use forkserver/spawn (the getstate and setstate are needed). Also forking the persistent env across workers causes weird behavior/failures with versions of lmdb > 1.6, it hasnt worked for me on multiple systems which is why we pinned it. They added some documentation about this recently: https://lmdb.readthedocs.io/en/latest/#forking-multiprocessing

christinaflo · 2026-04-13T03:01:48Z

Minor comment: the DB preloading logic might be better handled by using vmtouch, which is available in conda‑forge and often present on HPC systems, falling back to reading the whole DB file only when vmtouch is not available. Conceptually, this would only need to be done once per node, since the page cache is shared across processes.

If using vmtouch, one could optionally consider daemon mode and page locking to reduce the risk of eager eviction under memory pressure, although this may or may not be desirable depending on system limits and sharing policies (I assume we typically reserve full nodes?). Overall, I suspect the usefulness of this depends strongly on DB size vs memory size, access locality, and competition for memory.

I just found this which could be vendored. I never used it.

I've never used it before, it isnt currently available on the cluster im using but happy to try it out via conda forge. The biggest db we have is around 10 GB. On one system, i did notice that I had to "rewarm" the cache between epochs so this could be useful to avoid that.

christinaflo added 3 commits March 25, 2026 09:27

More dataloader options including support for different multiprocessi…

ee128d7

…ng methods. LMDB refactor to allow for forkserver/spawn serialization + resolve issues that required pinning lmdb for training.

LMDB fixes for forked processes with latest version of lmdb, also ens…

4e7fca6

…ure connection is closed when dataset init finishes prior to forking

Move release connection call

918ea30

christinaflo requested review from jandom and jnwei March 26, 2026 02:32

christinaflo self-assigned this Mar 26, 2026

Merge branch 'main' into feature/multiprocessing-lmdb-refactor

027fe92

jandom self-assigned this Apr 8, 2026

jandom reviewed Apr 9, 2026

View reviewed changes

add test_lmdb.py

79a038c

jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for forkserver/spawn serialization and fix for LMDB upgrade issues#148

Fixes for forkserver/spawn serialization and fix for LMDB upgrade issues#148
christinaflo wants to merge 5 commits intomainfrom
feature/multiprocessing-lmdb-refactor

christinaflo commented Mar 26, 2026

Uh oh!

sdvillal commented Apr 8, 2026

Uh oh!

jandom commented Apr 8, 2026

Uh oh!

jandom left a comment

Uh oh!

christinaflo commented Apr 9, 2026

Uh oh!

christinaflo commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

christinaflo commented Mar 26, 2026

Uh oh!

sdvillal commented Apr 8, 2026

Uh oh!

jandom commented Apr 8, 2026

Uh oh!

jandom left a comment

Choose a reason for hiding this comment

Uh oh!

christinaflo commented Apr 9, 2026

Uh oh!

christinaflo commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants